Improve HTTP response handling and retry logic by MichaelGHSeg · Pull Request #520 · segmentio/analytics-python

MichaelGHSeg · 2026-02-12T01:07:50Z

Summary

Implements improved HTTP response handling and retry logic to align with changes in analytics-java and analytics-next:

✅ Add Authorization header (Basic auth with write key, OAuth Bearer token)
✅ Add X-Retry-Count header on all requests (starts at 0)
✅ Implement Retry-After header support (capped at 300s, doesn't count against retry budget)
✅ Granular status code classification for retryable vs non-retryable errors
✅ Replace backoff decorator with custom retry loop for better control
✅ Exponential backoff with jitter (0.5s base, 60s cap)
✅ Increase max retries from 10 to 1000 (to accommodate faster retry cadence)
✅ Clear OAuth token on 511 Network Authentication Required
✅ 413 Payload Too Large is non-retryable

Status Code Handling

Retryable 4xx: 408, 410, 429, 460
Non-retryable 4xx: 400, 401, 403, 404, 413, 422, and all other 4xx
Retryable 5xx: All except 501, 505
Non-retryable 5xx: 501, 505

Retry-After Behavior

Respected for status codes: 429
Retry-After attempts do NOT count against backoff retry budget
Capped at 300 seconds maximum delay

Test Coverage

Added 30 new comprehensive tests
Total test suite: 106 tests passing ✓
Tests cover: Authorization headers, X-Retry-Count, Retry-After parsing, status code classification, backoff logic, network errors, retry budget separation

Changes

Modified Files

segment/analytics/consumer.py: Custom retry loop, status code classification, Retry-After support
segment/analytics/request.py: Authorization header, X-Retry-Count header, parse_retry_after()
segment/analytics/client.py: Increased default max_retries to 1000
segment/analytics/test/test_consumer.py: 13 new test cases
segment/analytics/test/test_request.py: 17 new test cases

Alignment

This implementation aligns with:

analytics-java commit c3f60c5 and 80cc33d
analytics-next commit 31e275d2

🤖 Generated with Claude Code

- Add Authorization header with Basic auth (base64 encoded write key) - Add X-Retry-Count header on all requests (starts at 0) - Implement Retry-After header support (capped at 300s) - Retry-After attempts don't count against backoff retry budget - Add granular status code classification: - Retryable 4xx: 408, 410, 429, 460 - Non-retryable 4xx: 400, 401, 403, 404, 413, 422 - Retryable 5xx: all except 501, 505 - Non-retryable 5xx: 501, 505 - Replace backoff decorator with custom retry loop - Exponential backoff with jitter (0.5s base, 60s cap) - Clear OAuth token on 511 Network Authentication Required - 413 Payload Too Large is non-retryable - Add 30 new comprehensive tests (106 total tests) Aligns with analytics-java and analytics-next retry behavior. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Aligns with analytics-java change to accommodate shorter backoff periods (0.5s base, 60s cap). With faster retries, a higher retry limit allows for better resilience during extended outages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR improves HTTP response handling and retry logic in the analytics-python client, aligning with changes from analytics-java and analytics-next. The implementation replaces the backoff decorator with a custom retry loop that provides fine-grained control over retry behavior, distinguishing between Retry-After attempts and exponential backoff attempts.

Changes:

Replaced backoff decorator with custom retry loop implementing exponential backoff with jitter (0.5s base, 60s cap)
Added Authorization header support (Basic auth with write key and OAuth Bearer token)
Added X-Retry-Count header to track attempt numbers
Implemented Retry-After header support (capped at 300s, doesn't count against retry budget)
Granular status code classification distinguishing retryable from non-retryable errors
Increased default max_retries from 10 to 1000 to accommodate faster retry cadence
OAuth token clearing expanded to include 511 status code

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
segment/analytics/request.py	Added Authorization header (Basic/Bearer), X-Retry-Count header, parse_retry_after() function, and response object to APIError
segment/analytics/consumer.py	Replaced backoff decorator with custom retry loop implementing exponential backoff, Retry-After support, and granular status code classification
segment/analytics/client.py	Increased default max_retries from 10 to 1000
segment/analytics/test/test_request.py	Added 17 comprehensive tests for authorization headers, X-Retry-Count, Retry-After parsing, and OAuth token clearing
segment/analytics/test/test_consumer.py	Added 13 comprehensive tests for retry logic, status code classification, backoff behavior, and Retry-After support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- Extract duplicate exponential backoff calculation into helper function - Add upper bound (max_total_attempts) to prevent infinite retry loops with Retry-After - Improves code maintainability and prevents edge case of continuous Retry-After responses Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Remove 408/503 from Retry-After eligibility (only 429 uses Retry-After) - Add rate-limit state to Consumer (rate_limited_until, rate_limit_start_time) - 429 with Retry-After: set rate-limit state, raise to caller for requeue - 429 without Retry-After: counted backoff (not pipeline blocking) - Add maxTotalBackoffDuration / maxRateLimitDuration config (default 43200s) - upload() checks rate-limit state before request(), enforces duration limit - 511 OAuth gating: only retry when OauthManager is configured - Add tests: T04, T17, T19, T20; update 429/408/503 behavior tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Handle Retry-After: 0 correctly by checking 'is not None' instead of truthiness. Prevent silent batch re-queue on 429 without Retry-After by gating the upload() re-queue path on rate_limited_until being set. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

segment/analytics/consumer.py:158

When a 429 with Retry-After is received (lines 136-144), the batch items are re-queued using queue.put(), but then in the finally block (lines 156-158), all items in the batch are marked as task_done(). This means the items that were re-queued are still marked as done, which could cause queue.join() to return prematurely before those re-queued items are actually processed. Consider either not calling task_done() for re-queued items, or handling the re-queueing differently.

            if e.status == 429 and self.rate_limited_until is not None:
                # 429: rate-limit state already set by request(). Re-queue batch.
                self.log.debug('429 received. Re-queuing batch and halting upload iteration.')
                for item in batch:
                    try:
                        self.queue.put(item, block=False)
                    except Exception:
                        pass  # Queue full, item lost
                success = False
            else:
                self.log.error('error uploading: %s', e)
                success = False
                if self.on_error:
                    self.on_error(e, batch)
        except Exception as e:
            self.log.error('error uploading: %s', e)
            success = False
            if self.on_error:
                self.on_error(e, batch)
        finally:
            # mark items as acknowledged from queue
            for _ in batch:
                self.queue.task_done()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Add KeyError to except clause when parsing JSON response to handle missing 'code' or 'message' keys - Add explanatory comment on pre-existing except-pass pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

requests.Response.__bool__() returns False for non-2xx status codes. The checks `if e.response` and `if response` evaluated to False for 429 responses, so parse_retry_after() was never called and the SDK fell back to normal backoff instead of respecting Retry-After. Changed both checks to explicit `is not None` comparisons. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add on_error handler to capture delivery failures from the SDK. Reports success=false with the first error message when any batch fails (non-retryable error or retries exhausted). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Align retry configuration with cross-library defaults. Base backoff (500ms) and max backoff (60s) already matched. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

First request (retry_count=0) no longer includes the header. Retries with retry_count > 0 continue to send X-Retry-Count: 1, 2, 3, etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Aligns with the analytics-next Node reference implementation which treats all 200-399 responses as successful delivery. Previously only exact 200 was treated as success, causing 201/204/3xx to be misreported as errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ates

- Resolve python via activated venv/devbox, fall back to python3 - Use $PYTHON -m pip instead of bare pip (fixes macOS/nix where pip is not on PATH) - Add devbox setup as recommended path in README, with venv fallback instructions

MichaelGHSeg · 2026-05-07T18:47:13Z

Deep Code Review — `response-status-updates`

Three parallel reviewer passes covering correctness/bugs, test coverage, and edge cases.

Critical / Must-Fix

1. `backoff` dependency not removed from `setup.py` and `requirements.txt`

import backoff was removed from consumer.py, but setup.py still declares "backoff~=2.1" as an install dependency, and requirements.txt still lists backoff==2.2.1. This will keep shipping an unused transitive dependency to every user.

# setup.py — remove this line:
"backoff~=2.1",

2. `None` passed as `max_total_backoff_duration` / `max_rate_limit_duration` causes a `TypeError`

client.py accepts None for both parameters (the guard is if value is not None and value < 0), but consumer.py does an unguarded arithmetic comparison:

# consumer.py line 275 — blows up if max_total_backoff_duration is None:
if time.time() - first_failure_time > self.max_total_backoff_duration:

# consumer.py line 105 — same issue with max_rate_limit_duration:
if now - self.rate_limit_start_time > self.max_rate_limit_duration:

Either reject None explicitly in client.py (simplest fix), or guard the comparisons with if self.max_total_backoff_duration is not None and ....

Important / Should-Fix

3. `max_total_backoff_duration=0` allows one extra attempt

The duration check uses a strict > comparison:

if first_failure_time is None:
    first_failure_time = time.time()       # set AFTER first failure
if time.time() - first_failure_time > self.max_total_backoff_duration:
    raise

When max_total_backoff_duration=0, the very first time through first_failure_time is set and then immediately checked, but virtually 0 time has elapsed, so the check fails (~0 > 0 is False). The retry sleeps for 0s and a second attempt is made before the duration check fires. The semantics of max_total_backoff_duration=0 should be "give up immediately after the first failure" — change > to >= or document the off-by-one.

4. Tight re-queue loop when `Retry-After: 0` is sent

parse_retry_after returns 0 for a Retry-After: 0 header (due to max(delay, 0)). set_rate_limit_state then sets rate_limited_until = time.time() + 0, which is immediately in the past. upload() sees a non-None rate_limited_until, skips the sleep (because wait_time <= 0), calls request(), gets another 429+Retry-After: 0, re-queues, and loops — tight-spinning with no delay. The fix is to treat Retry-After: 0 as no-Retry-After (return None from parse_retry_after when delay == 0), or unconditionally floor the wait to at least some small epsilon before the sleep guard.

5. Queue re-queue on full silently drops items with no `on_error` call

for item in batch:
    try:
        self.queue.put(item, block=False)
    except Exception:
        pass  # Queue full, item lost

When queue.put raises because the queue is full, the item is silently discarded — no log, no on_error callback. Users will lose events without any indication. At minimum, add a log at error level and call self.on_error if set:

except Exception:
    self.log.error('Queue full after 429 re-queue; item dropped: %s', item)
    if self.on_error:
        self.on_error(Exception('Queue full, item dropped after 429 re-queue'), [item])

6. `flush()` can block for up to `max_rate_limit_duration` (12 hours default)

When a 429+Retry-After is received, the batch is re-queued. queue.join() (called by flush()) waits until every put has a corresponding task_done(). If the server keeps rate-limiting, flush() will block for up to max_rate_limit_duration (12 hours by default). This is intentional behavior but has major operational consequences that should be documented in the flush() docstring and in the README/CHANGELOG.

7. `is_retryable_status` returns `True` for `410 Gone`

HTTP 410 means the resource has been permanently removed. Unlike 408 (timeout) or 429 (rate limit), retrying a 410 will never succeed with the same payload. Including it as retryable is semantically incorrect — it will exhaust the retry budget on a request that cannot succeed. Please confirm whether this is intentional (e.g., Segment's API uses 410 for a transient/recoverable purpose) and add a comment if so.

Minor / Nice-to-Have

8. `FatalError` catch is non-obvious without reading `oauth_manager.py`

consumer.request() has an explicit except FatalError branch, but FatalError is never raised within post() itself — it surfaces only from oauth_manager.get_token() via the OauthManager._poller_loop thread. This is correct but hard to follow. A comment would help:

except FatalError as e:
    # Raised by OauthManager.get_token() after exhausting OAuth token retries
    self.log.error(f"Fatal error after {total_attempts} attempts: {e}")
    raise

9. `Retry-After` HTTP-date format silently falls back to counted backoff

parse_retry_after returns None for an HTTP-date format Retry-After header (RFC 7231 allows both integer seconds and HTTP-date). When this happens, a 429 with an HTTP-date Retry-After is treated as "no Retry-After" and falls through to the counted backoff path — consuming the retry budget — rather than the pipeline-blocking path. At minimum, log a warning:

except ValueError:
    log.warning('Unparseable Retry-After header value: %r (HTTP-date format not supported)', retry_after)
    return None

10. Fragile `task_done()` separation between early-return path and `finally`

The rate-limit-exceeded early-return (lines 99-119) manually calls task_done() for each item before returning, relying on the fact that the try/finally at line 130 is never entered. This creates a maintenance trap — if a future refactor moves the rate-limit check inside the try block, task_done() would be called twice per item, causing queue.join() to return prematurely. Add a comment or refactor to use a single code path.

11. Thread name not set on `Consumer` threads

Consumer threads are daemon threads but have no name attribute set, making them hard to identify in stack traces and debuggers. Consider adding self.name = f'segment-consumer-{write_key[:8]}' in __init__.

12. Duplicate backoff code in `request()` for `APIError` and generic `Exception`

The except APIError and except Exception branches in request() contain identical backoff logic (check first_failure_time, check max_total_backoff_duration, increment backoff_attempts, call calculate_backoff_delay, time.sleep). Extract into a helper to avoid drift:

def _handle_retryable_failure(e):
    nonlocal first_failure_time, backoff_attempts
    if first_failure_time is None:
        first_failure_time = time.time()
    if time.time() - first_failure_time > self.max_total_backoff_duration:
        raise
    backoff_attempts += 1
    if backoff_attempts >= self.retries + 1:
        raise
    delay = calculate_backoff_delay(backoff_attempts)
    self.log.debug(...)
    time.sleep(delay)

Test Coverage Gaps

Gap	Severity
`max_total_backoff_duration=None` — triggers `TypeError` in consumer	Critical (matches issue #2 above)
`max_rate_limit_duration=None` — triggers `TypeError` in consumer	Critical (matches issue #2 above)
Queue full during 429 re-queue — items silently dropped	Important (matches issue #5)
`Retry-After: 0` tight-loop behavior	Important (matches issue #4)
`Retry-After` with HTTP-date format falls back to counted backoff	Nice-to-have
Multi-consumer (`thread=2`) rate-limit state is not shared	Nice-to-have
`sync_mode` has no retry logic — failures propagate directly	Documentation gap
`retries=0` with a retryable (non-429) error	Minor

The happy-path and primary retry scenarios are very well covered. The new test suite is a significant improvement.

Positive Notes

The manual retry loop replacing backoff.expo gives much better control over retry semantics. The dual-budget model (per-attempt count + wall-clock duration) is solid.
Separating 429+Retry-After (pipeline blocking) from 429-without-Retry-After (counted backoff) is the right design.
X-Retry-Count header omitted on first attempt is correct — servers can use presence of the header to detect retries.
parse_retry_after capping at 300s is a good safety valve.
The on_error callback hookup in e2e-cli is a real improvement for observability in e2e tests.
The Basic auth migration from HTTPBasicAuth to explicit base64 is fine (equivalent behavior, removes one import).

MichaelGHSeg and others added 2 commits February 11, 2026 19:20

Increase max retries from 10 to 1000

e4f34d0

Aligns with analytics-java change to accommodate shorter backoff periods (0.5s base, 60s cap). With faster retries, a higher retry limit allows for better resilience during extended outages. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MichaelGHSeg requested a review from Copilot February 12, 2026 22:49

Copilot started reviewing on behalf of MichaelGHSeg February 12, 2026 22:49 View session

Copilot AI reviewed Feb 12, 2026

View reviewed changes

MichaelGHSeg and others added 6 commits February 12, 2026 18:07

Update segment/analytics/test/test_request.py

8bd4163

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestions from code review

e1ac214

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'master' into response-status-updates

2165298

MichaelGHSeg requested a review from Copilot February 25, 2026 19:12

Copilot started reviewing on behalf of MichaelGHSeg February 25, 2026 19:12 View session

Copilot AI reviewed Feb 25, 2026

View reviewed changes

Comment thread segment/analytics/request.py Outdated

Comment thread segment/analytics/test/test_consumer.py

MichaelGHSeg and others added 10 commits February 25, 2026 16:39

Address PR review: catch KeyError in response parsing

115d124

- Add KeyError to except clause when parsing JSON response to handle missing 'code' or 'message' keys - Add explanatory comment on pre-existing except-pass pattern Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Enabling retry e2e test set

2474b46

Consolidate backoff parameters: max retries 1000 -> 10

ea257f8

Align retry configuration with cross-library defaults. Base backoff (500ms) and max backoff (60s) already matched. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'master' into response-status-updates

184fe1f

Omit X-Retry-Count header on first attempt, send only on retries

12d2f08

First request (retry_count=0) no longer includes the header. Retries with retry_count > 0 continue to send X-Retry-Count: 1, 2, 3, etc. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into response-status-upd…

f0c39a7

…ates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve HTTP response handling and retry logic#520

Improve HTTP response handling and retry logic#520
MichaelGHSeg wants to merge 18 commits intomasterfrom
response-status-updates

MichaelGHSeg commented Feb 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

MichaelGHSeg commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MichaelGHSeg commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Status Code Handling

Retry-After Behavior

Test Coverage

Changes

Modified Files

Alignment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

MichaelGHSeg commented May 7, 2026

Deep Code Review — response-status-updates

Critical / Must-Fix

1. backoff dependency not removed from setup.py and requirements.txt

2. None passed as max_total_backoff_duration / max_rate_limit_duration causes a TypeError

Important / Should-Fix

3. max_total_backoff_duration=0 allows one extra attempt

4. Tight re-queue loop when Retry-After: 0 is sent

5. Queue re-queue on full silently drops items with no on_error call

6. flush() can block for up to max_rate_limit_duration (12 hours default)

7. is_retryable_status returns True for 410 Gone

Minor / Nice-to-Have

8. FatalError catch is non-obvious without reading oauth_manager.py

9. Retry-After HTTP-date format silently falls back to counted backoff

10. Fragile task_done() separation between early-return path and finally

11. Thread name not set on Consumer threads

12. Duplicate backoff code in request() for APIError and generic Exception

Test Coverage Gaps

Positive Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MichaelGHSeg commented Feb 12, 2026 •

edited

Loading

Deep Code Review — `response-status-updates`

1. `backoff` dependency not removed from `setup.py` and `requirements.txt`

2. `None` passed as `max_total_backoff_duration` / `max_rate_limit_duration` causes a `TypeError`

3. `max_total_backoff_duration=0` allows one extra attempt

4. Tight re-queue loop when `Retry-After: 0` is sent

5. Queue re-queue on full silently drops items with no `on_error` call

6. `flush()` can block for up to `max_rate_limit_duration` (12 hours default)

7. `is_retryable_status` returns `True` for `410 Gone`

8. `FatalError` catch is non-obvious without reading `oauth_manager.py`

9. `Retry-After` HTTP-date format silently falls back to counted backoff

10. Fragile `task_done()` separation between early-return path and `finally`

11. Thread name not set on `Consumer` threads

12. Duplicate backoff code in `request()` for `APIError` and generic `Exception`